Skip to content

Conversation

behzad-mir
Copy link
Contributor

@behzad-mir behzad-mir commented Jul 14, 2025

Adding a CNI Telemetry sidecar to CNS pod to replace the azure-vnet-telemetry binary.

Requirements:

Notes:

@Copilot Copilot AI review requested due to automatic review settings July 14, 2025 16:17
@behzad-mir behzad-mir requested a review from a team as a code owner July 14, 2025 16:17
@behzad-mir behzad-mir requested a review from csfmomo July 14, 2025 16:17
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes the CNI telemetry service initialization issue where it was failing to start due to an empty Application Insights instrumentation key. The fix ensures that the telemetry service uses the proper AI configuration from the CNS configuration instead of an empty basic configuration.

  • Updates the startTelemetryService function to accept CNS configuration and properly initialize AI telemetry settings
  • Adds validation to check for Application Insights instrumentation key before starting the telemetry service
  • Introduces proper telemetry configuration mapping from CNS settings to AI config

@@ -94,6 +94,7 @@ const (
// Service name.
name = "azure-cns"
pluginName = "azure-vnet"
aiPluginName = "AzureCNI"
Copy link
Preview

Copilot AI Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a typo in the PR title 'Telemtry' which should be 'Telemetry'. While this constant name is correct, ensure consistency in naming throughout the codebase.

Copilot uses AI. Check for mistakes.

@behzad-mir behzad-mir added do-not-merge cni Related to CNI. labels Jul 14, 2025
@tamilmani1989 tamilmani1989 requested review from QxBytes and rbtr July 14, 2025 16:58
Comment on lines 497 to 500
if cnsconfig.TelemetrySettings.AppInsightsInstrumentationKey != "" {
err = tb.CreateAITelemetryHandle(aiConfig, ts.DisableTrace, ts.DisableMetric, ts.DisableEvent)
} else {
logger.Printf("No Application Insights key provided for CNI telemetry service")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should check other way and remove else?

if cnsconfig.TelemetrySettings.AppInsightsInstrumentationKey == "" {
  logger.Printf("No Application Insights key provided for CNI telemetry service")
  return
}
err = tb.CreateAITelemetryHandle(aiConfig, ts.DisableTrace, ts.DisableMetric, ts.DisableEvent)

@santhoshmprabhu
Copy link
Contributor

santhoshmprabhu commented Jul 14, 2025

Discussed in the team meeting today - we may want to consider doing telemetry for CNI directly in CNI rather than CNS, to account for scenarios where CNI can't talk to CNS.
Tagging @tamilmani1989 as decided in the call.

@behzad-mir behzad-mir force-pushed the telemetry-statelessCNIFix branch from 2389f47 to df332ec Compare July 22, 2025 05:10
@behzad-mir behzad-mir requested a review from a team as a code owner July 22, 2025 05:10
@behzad-mir behzad-mir force-pushed the telemetry-statelessCNIFix branch 5 times, most recently from dde3441 to b59b87b Compare July 22, 2025 06:37
@behzad-mir behzad-mir changed the title fix: Fixing CNI Telemtry Service run by CNS Feat: Adding a CNI Telemetry sidecar to CNS pod to replace the azure-vnet-telemetry binary Jul 22, 2025
Copy link
Contributor

@QxBytes QxBytes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on my understanding, this pr replaces azure-vnet-telemetry binary on the node with a sidecar that does the same thing. So CNI would send AI events to this sidecar via a socket if it is present (rather than the azure vnet telemetry long running process). I believe right now azure-vnet-telemetry is still on the vhd (?) so CNI will still attempt to start that azure vnet telemetry binary service. Can you check that there's no race or issue with both CNS and Azure CNI trying to start the telemetry service at the same time?

Makefile Outdated
CNI_TELEMETRY_SIDECAR_DIR = $(REPO_ROOT)/cns/cni-telemetry-sidecar
CNI_TELEMETRY_SIDECAR_BUILD_DIR = $(BUILD_DIR)/cni-telemetry-sidecar
CNI_TELEMETRY_SIDECAR_AI_ID = $(CNI_AI_ID) # Reuse CNI AI ID
CNI_TELEMETRY_SIDECAR_VERSION = $(CNS_VERSION) # Version follows CNS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move the version to where all the other version variables are?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if you're intending this to be like azure ip masq merger with its own release versions etc. you'll need to add the exclusion to the tag above: ACN_VERSION ?= $(shell git describe --exclude "azure-ip-masq-merger*" --exclude "azure-ipam*" --exclude "dropgz*" --exclude "zapai*" --exclude "ipv6-hp-bpf*" --tags --always)

Makefile Outdated
@@ -123,7 +128,7 @@ all-binaries-platforms: ## Make all platform binaries

# OS specific binaries/images
ifeq ($(GOOS),linux)
all-binaries: acncli azure-cni-plugin azure-cns azure-npm azure-ipam azure-ip-masq-merger ipv6-hp-bpf
all-binaries: acncli azure-cni-plugin azure-cns azure-npm azure-ipam azure-ip-masq-merger ipv6-hp-bpf cni-telemetry-sidecar
all-images: npm-image cns-image cni-manager-image azure-ip-masq-merger-image ipv6-hp-bpf-image
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like you're making a new image-- if this is like a new sidecar image/container image, there are other dockerfiles/scripts you may need to edit (ex: for the signed pipeline). would the image target be added here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per @rbtr request I moved it to azure-cni image. I removed all the Makefile changes as a result.

Makefile Outdated
@@ -275,6 +284,7 @@ CNI_IMAGE = azure-cni
CNS_IMAGE = azure-cns
NPM_IMAGE = azure-npm
AZURE_IP_MASQ_MERGER_IMAGE = azure-ip-masq-merger
CNI_TELEMETRY_SIDECAR_IMAGE = azure-cni-telemetry-sidecar
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if you want to keep the variable naming consistent? it's CNI_TELEMETRY here but the image name is azure-cni-telemetry-sidecar-- is the "sidecar" part of the name necessary (here and elsewhere)? If it's a container image I feel like it's already implied that it'll be running in a sidecar or similar.

Makefile Outdated
# Create a CNI Telemetry Sidecar archive for the target platform.
.PHONY: cni-telemetry-sidecar-archive
cni-telemetry-sidecar-archive: cni-telemetry-sidecar-binary
ifeq ($(GOOS),linux)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this binary is on windows and linux would we omit this linux check?


// Validate batch interval for optimal Azure ingestion
if config.TelemetrySettings.TelemetryBatchIntervalInSecs <= 0 {
logger.Printf("Warning: Invalid telemetry batch interval, using default")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does ai under the hood use the default automatically if it's not valid, or should this also modify the value to the default (also applies to above conditional)?

flag.Parse()

// Initialize logging for the CNI telemetry sidecar
logger.InitLogger("azure-cns-cni-telemetry-sidecar", 1, 1, "/var/log/azure-cns-telemetry")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the naming be the same-- would it be cns or cni telemetry? Like below the logger prints that it's azure cni telemetry sidecar but the log is to the file azure-cns-telemetry

const (
// CNI telemetry constants aligned with azure-vnet-telemetry
cniTelemetryAppName = "azure-vnet-telemetry"
cniTelemetryVersion = "1.0.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do we want to start with 0.0.x release train or 1.0.x?

@behzad-mir behzad-mir force-pushed the telemetry-statelessCNIFix branch 7 times, most recently from c0a81ad to a949253 Compare July 28, 2025 23:11
@behzad-mir behzad-mir force-pushed the telemetry-statelessCNIFix branch 2 times, most recently from 8cf84d3 to 0de7cc1 Compare July 31, 2025 22:50
Copy link
Collaborator

@rbtr rbtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of a new image, let's package this in azure-cni - it's already cached and even embedded in most CNS daemonsets today

@behzad-mir behzad-mir force-pushed the telemetry-statelessCNIFix branch from 0de7cc1 to 6c70dc7 Compare August 2, 2025 01:56
@behzad-mir behzad-mir force-pushed the telemetry-statelessCNIFix branch 2 times, most recently from 68a8df3 to e505a24 Compare August 4, 2025 17:31
@behzad-mir
Copy link
Contributor Author

based on my understanding, this pr replaces azure-vnet-telemetry binary on the node with a sidecar that does the same thing. So CNI would send AI events to this sidecar via a socket if it is present (rather than the azure vnet telemetry long running process). I believe right now azure-vnet-telemetry is still on the vhd (?) so CNI will still attempt to start that azure vnet telemetry binary service. Can you check that there's no race or issue with both CNS and Azure CNI trying to start the telemetry service at the same time?

We are aming to run this on Stateless CNI in specific K8s version. So, it will not run the azure-vnet-telemtry.

@behzad-mir behzad-mir requested a review from a team as a code owner August 14, 2025 19:25
@behzad-mir behzad-mir requested a review from rbtr August 14, 2025 19:25
Makefile Outdated
@@ -84,6 +84,10 @@ ACN_PACKAGE_PATH = github.com/Azure/azure-container-networking
CNI_AI_PATH=$(ACN_PACKAGE_PATH)/telemetry.aiMetadata
CNS_AI_PATH=$(ACN_PACKAGE_PATH)/cns/logger.aiMetadata
NPM_AI_PATH=$(ACN_PACKAGE_PATH)/npm.aiMetadata
CNI_TELEMETRY_SIDECAR_DIR = $(REPO_ROOT)/cns/cni-telemetry-sidecar
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have azure- prefix?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can it be acn-node-telemetry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tamilmani1989 I removed all the Makefile changes and add the binary to the CNI DOCker instead per @rbtr suggestion.

@behzad-mir behzad-mir force-pushed the telemetry-statelessCNIFix branch from 8bb002c to 4c8edcf Compare August 14, 2025 19:32
@behzad-mir behzad-mir force-pushed the telemetry-statelessCNIFix branch from 4c8edcf to e9e0c11 Compare August 15, 2025 18:47
Copy link

This pull request is stale because it has been open for 2 weeks with no activity. Remove stale label or comment or this will be closed in 7 days

@github-actions github-actions bot added the stale Stale due to inactivity. label Aug 30, 2025
Copy link

github-actions bot commented Sep 7, 2025

Pull request closed due to inactivity.

@github-actions github-actions bot closed this Sep 7, 2025
@github-actions github-actions bot deleted the telemetry-statelessCNIFix branch September 7, 2025 00:01
@rbtr rbtr restored the telemetry-statelessCNIFix branch September 8, 2025 16:09
@rbtr rbtr reopened this Sep 8, 2025
@rbtr rbtr added exempt-stale Keep this fresh and removed stale Stale due to inactivity. labels Sep 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cni Related to CNI. do-not-merge exempt-stale Keep this fresh
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants